Regression Assumptions

Neil Lund

2025-01-03

Why Linear Regression

Linear regression is at least a good starting point for a wide variety of social science questions:

  • Make a prediction about Y based on values of X (vote share ~ approval + GDP)

  • Identify a causal effect (% lung cancer deaths ~ % smokers)

  • Describe or quantify a relationship between Y and X (College GPA ~ SAT score)

Even where OLS is sub-optimal, its a OLS is a good starting point to learn the alternatives.

Correlation and regression: Galton’s peas

Galton’s results

Regression: peas

Regression: peas

Covariance

\[\frac{\sum\limits_{i=1}^n (\bar{X} - X_{i}) (\bar{Y} - Y_{i}) }{n}\]

parent parent - mean(parent) offspring offspring - mean(offspring) product
6 -4 4 -5 20
6 -4 7 -2 8
6 -4 10 1 -4
10 0 5 -4 0
10 0 9 0 0
10 0 13 4 0
14 4 8 -1 -4
14 4 11 2 8
14 4 14 5 20
48 / 9 = 5.3

Correlation

\[\frac{cov(X, Y)}{\sigma_{x}\sigma_{y}} \]

parent SD offspring SD product Correlation
3.3 3.2 3.3 X 3.2 = 10.56 5.3/10.56 = .5

(note: correlation is constrained between -1 to 1 because its standardized by the variance of X and Y)

Slope

\[\frac{cov(X, Y)}{\sigma_{x}^2}\]

parent SD Parent SD squared Slope
3.3 3.3^2 = 10.89 5.3 / 10.89 ~ .5

(note: Correlation and slope are very similar here, but would diverge when the standard deviation of Y is larger than the standard deviation of X)

Intercept

\[ \alpha = \bar{Y} - \beta\bar{X} \]

offspring mean parent mean intercept
9 10 9 - 10 X .5 = 4

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

4.000

3.346

1.195

0.2709

parent

0.500

0.318

1.572

0.1600

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 3.117 on 7 degrees of freedom

Multiple R-squared: 0.2609, Adjusted R-squared: 0.1553

F-statistic: 2.471 on 7 and 1 DF, p-value: 0.1600

The regression equation

\[ Y_{i} = \alpha + \beta X_{i} + u_i \]

  • \(X_i\) is a value of an IV
  • \(Y_i\)is a value of the DV
  • beta (\(\beta\)) is a coefficient (a slope, in geometry)
  • alpha (\(\alpha\)) is the constant (or y-intercept)
  • \(u_i\) is an error, also called a residual

So how do we know we know we’ve got the best fit?

Gauss-Markov Theorem: BLUE

Provided we meet the necessary conditions, the Estimator (the sample coefficient) will:

  • Best: minimize the loss function (smallest residuals)

  • Linear: produce linear predictions

  • Unbiased: have an expected value that equals the population parameter (sum of errors = 0)

Andrey Markov

Carl Gauss

Minimizing the error

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

65.083

1.601

40.647

0.0000

***

partisan_lean

-0.374

0.174

-2.152

0.0365

*

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 6.076 on 48 degrees of freedom

Multiple R-squared: 0.08798, Adjusted R-squared: 0.06898

F-statistic: 4.631 on 48 and 1 DF, p-value: 0.0365

Minimizing the error: example

A single error

The observed value is 63.2, so

\[u_{idaho} = 63.2 - 58.57 = 4.63\]

The RSS

\[RSS = \sum\limits_{i=1}^n (\hat{Y}_{i} - Y_i)^{2} = 1771.9\]

The RSS and the standard error of the model

\[ \hat{\sigma}^2 = \frac{RSS}{n-k} = \frac{1771.9}{48} = 36.91453 \]

The RSS and the standard error of beta

\[se(\hat{\beta}) = \sqrt{\frac{\sigma^{2}}{\sum_{i=1}^{n}(X_{i} - \bar{X})^2}} = \sqrt{\frac{36.91}{1219.572}} = 0.174\]

The best RSS

The Gauss-Markov theorem states that any other version of this line would have a larger RSS (thus, larger errors/more uncertainty/more variation in accuracy)

states<-poliscidata::states 
partisan_lean<-(abs(states$cook_index)) 
turnout<-states$vep04_turnout 
model<-lm(turnout~partisan_lean)  
# The OLS estimate:  
t_hat <- predict(model) 
sum((t_hat - turnout)^2)  
[1] 1771.897
# an alternative 
t_hat_2 <- 65.0834 + -0.4 * partisan_lean  
sum((t_hat_2 - turnout)^2)
[1] 1774.678

The conditions for BLUE

The OLS estimate is only “best” under certain conditions. So what are they?

  • Linearity

  • Homoskedasticity

  • Exogeneity

  • limited multicollinearity

Linearity

Interpretation 1: Y must be a linear function of X

There’s clearly something more complicated than a linear relationship here:

That said, we can make this a linear relationship by squaring X:

Linearity in the parameters

Linearity in the parameters, by contrast, means that the \(\beta\) coefficients themselves are non-linear in some way. For instance, in a logistic regression model, the part of the model we want to estimate is itself non-linear.

\[P(Y_i=1) = \frac{1}{1+e^{-(\beta_0 + \beta_1X_i...)}}\]

We can transform the stuff we’ve observed, but we can’t transform a parameter before we estimate it. So this is going to require a different approach.

Homoscedasticity

OLS assumes that the error term \(\epsilon\) has a constant variance across levels of the IV.

The residuals from this model are pretty constant as we move from higher to lower expected values of Y

But in this case, we see heteroscedasticity: the variance changes as Y gets big.

Homoscedasticity in practice

  • Data: Chapel Hill Expert Survey 1999-2019 trend file

  • DV: party positioning on immigration (as measured by area experts)

  • IV: % seats in lower house of parliament

Homoscedasticity in practice

  • Data: Current Population Survey 2007

  • DV: Hourly Earnings

  • IV: Years of Education

So what should be done here?

Don’t overreact! It’s mostly a problem at small sample sizes.

  • Heteroscedasticity can point to other problems with model specification that you want to address. So you should consider things like:

    • Omitted variables

    • Non-linear relationships

    • Outliers or skew

  • Heteroskedasticity-consistent standard errors are an easy fix and viewed as basically “cost free”

No perfect multicollinearity

For multiple regression, it must be feasible to actually estimate the effect of X independently of Y.

set.seed(100)

X = rnorm(100)
X2 = X
Y = 1 + X*2 

lm(Y ~ X +X2)

Call:
lm(formula = Y ~ X + X2)

Coefficients:
(Intercept)            X           X2  
          1            2           NA  

Or imperfect multicolinearity

library(faux)
N = 100
set.seed(100)
data <- rnorm_multi(n = N , 
                  mu = c(0, 0, 0),
                  sd = c(1, 1, 1),
                  r = c(0.99, .99, .99),
                  varnames = c("X1", "X2", "X3"),
                  empirical = FALSE)

data$Y = data$X1 + data$X2  + data$X3 + rnorm(N)

lm(Y ~ X1 + X2 + X3, data=data)%>%as_flextable()

Estimate

Standard Error

t value

Pr(>|t|)

(Intercept)

-0.083

0.109

-0.763

0.4475

X1

-0.078

0.838

-0.093

0.9264

X2

1.259

1.130

1.115

0.2678

X3

1.757

1.130

1.556

0.1230

Signif. codes: 0 <= '***' < 0.001 < '**' < 0.01 < '*' < 0.05

Residual standard error: 1.087 on 96 degrees of freedom

Multiple R-squared: 0.8879, Adjusted R-squared: 0.8844

F-statistic: 253.4 on 96 and 3 DF, p-value: 0.0000

Or imperfect multicollinearity

When we include a control variable, we’re essentially “taking out” the portions of X1 and Y that are correlated with X2 and X3.

Implications of multicollinearity

\[ RSS = \sum\limits_{i=1}^n (\hat{Y}_{i} - Y_i)^{2} \]

\[ \hat{\sigma}^2 = \frac{RSS}{n-k} \]

\[ se(\hat{\beta}) = \sqrt{\frac{\sigma^{2}}{\sum_{i=1}^{n}(X_{i} - \bar{X})^2}} \]

Regression Outliers

Not a requirement for BLUE, but related!

What is a regression outlier?

  • Conditionally unusual observations: not just high values of Y or X but higher than expected values of Y given X

  • Related to problems with skew, but you can have skew without outliers and vice versa

  • Outliers in small samples can undermine the normality and constant variance assumptions, but we’re usually most concerned about their potential to influence results.

Conditional unusualness

  • Data: Varieties of Democracy (V-DEM)
  • DV: Liberal-Democracy Score in most recent year available
  • IV: per-capita GDP in the same year

Red is the original model, black is the model after excluding the two observations highlighted in red

What’s going on here? Does it make sense to drop these observations?

Cook’s Distance

How big is the problem?

One way to quantify an outlier is to calculate the effect of dropping an observation on the results. Cooks D for observation i is:

\[D_i = \frac{\sum^n_{j=1}(\hat{Y_j} - \hat{Y_{j(i)}})^2}{MSE}\]

  • \(\hat{Y_j}\) the prediction for observation j from the full regression

  • \(\hat{Y_{j(i)}}\) the prediction for observation j after eliminating observation i

  • \(MSE\) the mean squared error of the regression

So, higher values mean that observation i is exerting more influence on our results.

A rule of thumb suggests examining outliers that have a D value over \(4/n\) where \(n\) is the number of observations

Cook’s Distance

So what should be done here?

Don’t just drop stuff! First consider:

  • Check for errors. (if some joker told a survey taker they were 120 years old, then you can probably justify dropping them, but then you might want to check for other bad responses)

  • Consider transforming variables. For continuous positive numbers, the natural log can make things far more normal

  • Model it. Especially if you can come up with a systematic explanation for why certain observations are weird.

So what should be done here?

Here’s the result after logging GDP and including a control variable for logged oil revenues